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LOW-RANK CORRECTION METHODS EOR ALGEBRAIC DOMAIN 
DECOMPOSITION PRECONDITIONERS * 


RUIPENG LI t and YOUSEF SAAD^ 

Abstract. This paper presents a parallel preconditioning method for distributed sparse linear 
systems, based on an approximate inverse of the original matrix, that adopts a general framework 
of distributed sparse matrices and exploits the domain decomposition method and low-rank cor¬ 
rections. The domain decomposition approach decouples the matrix and once inverted, a low-rank 
approximation is applied by exploiting the Sherman-Morrison-Woodbury formula, which yields two 
variants of the preconditioning methods. The low-rank expansion is computed by the Lanczos pro¬ 
cedure with reorthogonalizations. Numerical experiments indicate that, when combined with Krylov 
subspace accelerators, this preconditioner can be efficient and robust for solving symmetric sparse 
linear systems. Comparisons with other distributed-memory preconditioning methods are presented. 

Key words. Sherman-Morrison-Woodbury formula, low-rank approximation, distributed sparse 
linear systems, parallel preconditioner, incomplete LU factorization, Krylov subspace method, do¬ 
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1. Introduction. Preconditioning distributed sparse linear systems remains a 
challenging problem in high-performance multi-processor environments. Simple do¬ 
main decomposition (DD) algorithms such as the additive Schwarz method [TH [T31 |H1 
mso] are widely used and they usually yield good parallelism. A well-known prob¬ 
lem with these preconditioners is that they often require a large number of iterations 
when the number of domains used is large. As a result, the benefits of increased paral¬ 
lelism is often outweighed by the increased number of iterations. Algebraic MultiGrid 
(AMG) methods have achieved a good success and can be extremely fast when they 
work. However, their success is still somewhat restricted to certain types of problems. 
Methods based on the Schur complement technique such as the parallel Algebraic 
Recursive Multilevel Solver (pARMS) [29], which consist of eliminating interior un¬ 
knowns first and then focus on solving in some ways the interface unknowns, in the 
reduced system, are designed to be general-purpose. The difficulty in this type of 
methods is to find effective and efficient preconditioners for the distributed global 
reduced system. In the approach proposed in the present work, we do not try to solve 
the global Schur complement system exactly or even form it. Instead, we exploit the 
Sherman-Morrison-Woodbury (SMW) formula and a low-rank property to define an 
approximate inverse type preconditioner. 

Low-rank approximations have recently gained popularity as a means to com¬ 
pute preconditioners. For instance, LU factorizations or inverse matrices using the 
Fl-matrix format or the closely related Hierarchically Semi-Separable (HSS) ma¬ 
trix format rely on representing certain off-diagonal blocks by low-rank matrices 
[151 EH ESI mi m] . The main idea of this work is inspired by the recursive Multilevel 
Low-Rank (MLR) preconditioner [57] targeted at SIMD-type parallel machines such 
as those equipped with Graphic Processing Units (GPUs), where traditional ILU-type 
preconditioners have difficulty reaching good performance [5S]. Here, we adapt and 
extend this idea to the framework of distributed sparse matrices via DD methods. 
We refer to a preconditioner obtained by this approach as a DD based Low-Rank 
(DDLR) preconditioner. This paper considers only symmetric matrices. Extensions 
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to the nonsymmetric case are possible and will be explored in our future work. The 
paper is organized as follows: In Section]^ we briefly introduce the distributed sparse 
linear systems and discuss the domain decomposition framework. Section presents 
the two proposed strategies for using low-rank approximations in the SMW formula. 
Parallel implementation details are presented in Sectionj^ Numerical results of model 
problems and general symmetric linear systems are presented in Section and we 
conclude in Section |6] 

2. Background: distributed sparse linear systems. The parallel solution 
of a linear systems of the form 

Ax = b, (2-1) 

where ^ is an n x n large sparse symmetric matrix, typically begins by subdividing 
the problem into p parts with the help of a graph partitioner [g na EH Ea Ea [33]. 
Generally, this consists of assigning sets of equations along with the corresponding 
right-hand side values to subdomains. If equation number i is assigned to a given sub- 
domain, then it is common to also assign unknown number i to the same subdomain. 
Thus, each process holds a set of equations (rows of the linear system) and vector 
components associated with these rows. This viewpoint is prevalent when taking a 
purely algebraic viewpoint for solving systems of equations that arise from Partial 
Differential Equations (PDEs) or general unstructured sparse matrices. 

2.1. The local systems. In this paper we partition the problem using an edge 
separator as is done in the pARMS method for example. As shown in Figure [2T| once 
a graph is partitioned, three types of unknowns appear: (1) Interior unknowns that are 
coupled only with local unknowns; (2) Local interface unknowns that are coupled with 
both external and local unknowns; and (3) External interface unknowns that belong 
to other subdomains and are coupled with local interface unknowns. The rows of the 



Fig. 2.1. A local view of a distributed sparse matrix (left) and its matrix representation (right). 

matrix assigned to subdomain i can be split into two parts: a local matrix Ai that acts 
on the local unknowns and an interface matrix Xi that acts on the external interface 
unknowns. Local unknowns in each subdomain are reordered such that the interface 
unknowns are listed after the interior ones. Thus, each vector of local unknowns Xi is 
split into two parts: a subvector Ui of the internal components followed by a subvector 
yi of the local interface components. The right-hand-side vector hi is conformingly 
split into subvectors fi and pi. When the blocks are partitioned according to this 
splitting, the local system of equations can be written as 



Ai Xi bi 
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Here, is a set of the indices of the subdomains that are neighbors to subdomain i. 
The term EijUj is a part of the product which reflects the contribution to the local 
equations from the neighboring subdomain j. The result of this multiplication affects 
only the local interface equations, which is indicated by the zero in the top part of 


the second term of the left-hand side of (2.2). 


2.2. The interface and Schur complement matrices. The local system (2.2) 
is naturally split in two parts: the first part represented by the term AiXi involves 
only the local unknowns and the second part contains the couplings between the local 
interface unknowns and the external interface unknowns. Furthermore, the second 


row of the equations in (2.2) 


Efui + C,yi -k ^ Eijyj = gi 

J&Ni 


(2.3) 


defines both the inner-domain and the inter-domain couplings. It couples the interior 
unknowns Ui with the local interface unknowns yi and the external ones, yj. An 
alternative way to order a global system is to group the interior unknowns of all the 
subdomains together and all the interface unknowns together as well. The action of 


the operation on the left-hand side of (2.3) on the vector of all interface unknowns, 
i.e., the vector = [yJ, • • • , j/J], can be gathered into the following matrix C, 


C = 


/Cl 

Ei2 

■ Eip^ 

A'21 

C2 . 

■ E2p 

\^Epi 

Ep^2 

■ Cp) 


(2.4) 


Thus, if we reorder the equations so that the it^’s are listed first followed by the j/i’s, 
we obtain a global system which has the following form: 


(2.5) 


Bi 

El \ 


/ui\ 


(h\ 

B2 

E2 


U2 


f2 

Bp 

Ep 


Up 


fp 

El E^ ... El 

c y 

\y/ 


U/ 


where Ei is expanded from Ei by adding zeros and on the right-hand side, = 
\9i ^92 ^ ■ ■ ■ I ffj]- Writing the system in the form (2.21 is commonly adopted in practice 


when solving distributed sparse linear systems, while the form ( |2.5[ ) is more convenient 
for analysis. In what follows, we will assume that the global matrix is put in the form 
of (2.5). The form (2.2) will return in the discussions of Section]^ which deal with 
the parallel implementations. 

We will assume that each subdomain i has di interior unknowns and Si interface 
unknowns, i.e., the length of Ui is di and that of yt is Si. We will denote by s the size 
of y, i.e., s = Si -k S 2 + ■ ■ ■ + Sp. With this notation, each Ei is a matrix of size di x s^. 
The expanded version of this matrix, Ei is of size di x s and its columns outside of 
those corresponding to the unknowns in yi are zero. An illustration for 4 subdomains 
is shown in Figure [272] 


A popular way of solving a global system put into the form of (2.5) is to exploit 
the Schur complement techniques that eliminate the interior unknowns Ui first and 
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Fig. 2.2. An example of a 2-D Laplacian matrix which is partitioned into 4 subdomains and 
reordered according to f.8.5[) (left) and (right) respectively. 


then focus on solving in some way for the interface unknowns. The interior unknowns 
can then be easily recovered by back substitution. Assuming that Bi is nonsingular, 
Ui in (2.21 can be eliminated by means of the first equation: ut = — Eiiji) 


which yields, upon substitution in 


SiVi + ^ =gi- EJB^ ^ fi = 

jeNi 


( 2 . 6 ) 


in which Si is the local Schur complement, 


s,, = c,-eTb-^e,,. 


When written for each subdomain (2.61 yields the global Schur complement system 
that involves only the interface unknown vectors yi and the reduced system has a 
natural block structure. 


/ Si 

Ei2 

■ Eip\ 

(yi\ 


92 

E 21 

S 2 . 

■ E2p 

2/2 


\Epi 

Ep^2 

• Spj 

\yp) 


\9pJ 


Each of the diagonal blocks in this system is the local Schur complement matrix Si, 
which is dense in general. The off-diagonal blocks Eij are identical with those of the 



This defines a preconditioning operation for the global system. In the method pro¬ 
posed in this paper we do not try to solve the global Schur complement system or 
even form it. Instead, an approximate inverse preconditioner to the original matrix 
is obtained by exploiting a low-rank property and the SMW formula. 
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3. Domain decomposition with local low-rank corrections. The coeffi¬ 


cient matrix of the system (2.51 is of the form 

y4 = 


B E 
C 


(3.1) 


where B G 


\E & 


and C G Here, we abuse notation by using the 


same symbol A to represent the permuted version of the matrix in (2.11. The goal of 
this section is to build a preconditioner for the matrix (3.1). 


3.1. Splitting. We begin by splitting matrix A as follows 


and defining the n x s matrix, 


B 


C 


E 


E^ 


E = 


a-^E 

—al 


(3.2) 


(3.3) 


where I is the s x s identity matrix and a is a parameter. Then from (3.2) we 
immediately get the identity. 


B 

E ■ 


■ B + a-'^EE'^ 

0 

E''' 



0 

G + aH _ 


-EE^ 


(3.4) 


A remarkable property is that the operator EE"^ is local in that it does not involve 
inter-domain couplings. Specifically, we have the following proposition. 

Proposition 3.1. Consider the matrix X = EE"'" and its blocks Xij associated 
with the same blocking as for the matrix in (2.5). Then, for 1 < i,j < p we have: 

Xij = 0, for i ^ j 

x„, = e,eT. 


Proof. This follows from the fact that the columns of E associated with different 
subdomains are structurally orthogonal illustrated on the left side of Figure [2^ □ 
Thus, we can write 


A = Aq — EE , Aq = 


B + 


C + a"^! 


(3.5) 


with the matrix E defined in (3.3). From (3.5) and the SMW formula, we can derive 
the expression for the inverse of A. First define, 


G = I - E"'' A(f'E. 


(3.6) 


Then, we have 

A-' = A((' + A(('E{I - E^Aq'E)-'E^A((' = Aq 1 -h A(('EG-'e'^A((\ (3.7) 

G 

Note that the matrix G is often strongly diagonally dominant for matrices arising 
from the discretization of PDFs, and the parameter a can serve to improve diagonal 
dominance in the indefinite cases. 
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3.2. Low-rank approximation to the G matrix. In this section we will 
consider the case when A is symmetric positive definite (SPD). A preconditioner of 
the form 


M-^ = + (A^^E)G-^(E^ Ao^) 


can be readily obtained from (3.7) if we had an approximation G ^ to G Note that 


the application of this preconditioner will involve two solves with Aq instead of only 
one. It will also involve a solve with G which operates on the interface unknowns. 
Let us, at least formally, assume that we know the spectral factorization of E"^ A^^E 

H = E'^Aq^E = UAU^, 


U is unitary, and A is diagonal. From (3.5) we have Aq = A + EE 


where iJ e IK 

and thus Aq is SPD since A is SPD. Therefore, H is at least symmetric positive 
semidefinite (SPSD) and the following lemma shows that its eigenvalues are all less 
than one. 

Lemma 3.2. Let El = E"^ Aq^ E. Assume that A is SPD and the matrix I — H is 
nonsingular. Then we have 0 < A < 1, for each eigenvalue A of El. 


Proof. From (3.7), we have 


are 


E'^A-^E = H + H{E - H)-^H = H{E+{1 - H)-^H) = H{I - H)-^. 

Since A is SPD, E'^A~^E is at least SPSD. Thus, the eigenvalues of H{1 — H)~^ 
nonnegative, i.e., A/(I — A) > 0. So, we have 0 < A < 1. □ 

The goal now is to see what happens if we replace A by a diagonal matrix A. This 
will include the situation when a low-rank approximation is used for G but it can also 
include other possibilities. Suppose that El is approximated as follows: 


H « uKu'^. 

Then, from the SMW formula, the corresponding approximation to G~^ is: 


G 


-1 


G-^ = (/ - UAU^ )■ = / + U[{I - A)-^ - E]U 


(3.8) 


(3.9) 


Note in passing that the above expression can be simplified to U{I — A) . How¬ 
ever, we keep the above form because it will still be valid when U has only k {k < s) 


columns and A is k x k diagonal, in which case we denote by ^ the approximation 


in (3.9). At the same time, the exact G can be obtained as a special case of (3.9) 


where A is simply equal to A. Then we have 


A-i = Aq-i + {Aq^E)G-\E^Aq^), (3.10) 

and the preconditioner 

M-i = Aq-i + {Aq^E)G-\E^Aq^), (3.II) 

from which it follows by subtraction that 

A-i - M-i = (Ao-iA)(G-i - G-^){E^Aq^), 

and therefore, 

AM-i = I - A{Aq^E){G-^ - Gf^XE^AQ^). 
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(3.12) 







A first consequence of (3.12) is that there will be at lease m eigenvalues of AM 


that are equal to one, where m = n — s is the dimension of B in (3.1) or in other 


words, the number of the interior unknowns. From (3.9) we obtain 




-1 


][/’ 


(3.13) 


The simplest selection of A is the one that ensures that the k largest eigenvalues of 
(/ — A)“^ match the largest eigenvalues of (/ — A)“^. This simply minimizes the 


2-norm of (3.13) under the assumption that the approximation in (3.8) is of rank k. 

> Xs 


Assume that the eigenvalues of H are Ai > A 2 > 
diagonal entries Ai of A are selected such that 


A, = 


A. 

0 


if i < k 
otherwise 


This means that the 


(3.14) 


Observe that from (3.13) the eigenvalues of G ^ — Gj, are 


0 if i < k 

(1 — Ai)“^ — 1 otherwise 


Thus, from (3.12) we can infer that k more eigenvalues of AM ^ will take the value 


one in addition to the existing m ones revealed above independently of the choice of 
G~^. Noting that (1 — Ai)“^ — 1 = Ai/(1 — Ai) > 0, since 0 < Ai < 1 and we can say 
that the remaining s — k eigenvalues of AM~^ will be between 0 and 1. Therefore, the 
result in this case is that the preconditioned matrix AM~^ in (3.12) will have m + k 


eigenvalues equal to one, and s — k other eigenvalues between 0 and 1. 

From an implementation point of view, it is clear that a full diagonalization of H 
is not needed. All we need is Uk, the s x k matrix consisting of the first k columns 
of U, along with the diagonal matrix A^ of the corresponding eigenvalues Ai, • • • , Afc. 


Then, noting that (3.9) is still valid with U replaced by Uk and A replaced by A^, we 


can get the approximation Gfc and its inverse directly: 

Gk = I- UkAkUl, Gfc 1 = / + Uk[{I - Afe)-i - I]Ul. 


(3.15) 


It may have become clear to the reader that it is possible to select A so that AM ^ 
will have eigenvalues larger than one. Consider defining A such that 


A,; = 


A. 

e 


if 

if 


i < k 
i > k 


and denote by Gfc g the related analogue of (3.15). Then, from (3.13) the eigenvalues 


of G-i - G-l are 


0 

(i-A)-i-(i-0)-i 
Note that for i > k, we have 


if i < k 
if i > k ' 


1 1 Xi-0 

1-A " ~ (1-A)(i-0)’ 
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(3.16) 
















and these eigenvalues can be made negative by selecting A^+i < 0 < 1 and the choice 
that yields the smallest 2-norm is 0 = A^+i. The earlier definition of in (3.14) that 
truncates the eigenvalues of H to zero corresponds to selecting 0 = 0. 

Theorem 3.3. Assume that A is SPD and 0 is selected so that A^+i < 0 < 1. 
Then the eigenvalues rji of AM~^ are such that, 


1 < r?* < 1 + 


1 


1-0 


\\A^/^A^^E\\l 


(3.17) 


Furthermore, the term |j ^i?|j 2 is bounded from above by a constant: 

\\A^/^A^^E\\l<\. 


Froo/. We rewrite (3.12) as AM ^ = I + A{Aq^ E){Gi.^ — G ^)(i?^Ag or upon 
applying a similarity transformation with A^/^ 

^ j - G-^){E^Aq^A^/^). (3.18) 

From (3.16) we see that for j < k we have Aj(G)r^ — G~^) = 0, and for j > k, 


\-i 


0 < A,(G-^ - G-^) = (1 - 0)-^ - (1 - A,)-^ < (1 - 0) 

This is because 1/(1 — t) is an increasing function and for j > k, we have 0 < Aj < 
Afc+i < 0- The rest of the proof follows by taking the Rayleigh quotient of an arbitrary 


vector X and utilizing (3.18). 

For the second part, first note that ||A^/^Ag^F||| = p (F’^A/'^AAq , where 

p(-) denotes the spectral radius of a matrix. Then, from A = Aq — EE'^, we have 

E^Aq^AAq^E = E'^Aq^E - (f'^Aj/^F) {e'^Aq^E) = M 


Lemma [3.2| states that each eigenvalue A of i7 satisfies 0 < A < 1. Hence, for each 
eigenvalue p of F^Ag ^AAg p = A — A^, which is between 0 and 1/4 for A S [0,1). 
This gives the desired bound ||A^/^Aq < 1/4. □ 


Fig. 3.1. DDLR-1: eigenvalues of AM ^ with 9 = 0 (left) and 9 = (right) using k = b 

eigenvectors for a 900 x 900 2-D Laplacian with 4 subdomains and a = 1. 



liiU 


An illustration of the spectra of AM~^ for the two cases when 0 = 0 and 0 = A/c+i 
with fc = 5 is shown in Figure [TT| The original matrix is a 900 x 900 2-D Laplacian 
obtained from a finite difference discretization of a square domain using 30 mesh 
points in each direction. The number of the subdomains used is 4, resulting in 119 
interface unknowns. The reordered matrix associated with this example were shown 
in Figure [2?^ 

proved that ||A^/^A(/^i?|j 2 does not 


For the second choice 0 = A^+i, Theorem 


3.3 


exceed 1/4, regardless of the mesh size and regardless of a. Numerical experiments will 
show that this term is close to 1/4 for Laplacian matrices. For the case with a = 1, 
0 = Ae « 0.93492 and || A^/^Ag « 0.24996, so that the bound of the eigenvalues 
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of AM ^ given by (3.17) is 4.8413, which is fairly close to the largest eigenvalue, 
which is 4.3581 (cf. the right part of Figure [3A| ). When a = 2, 0 = Ag « 0.93987 and 
11^1/2^0 « 0.25000, so that the eigenvalue bound is 5.1575 whereas the largest 

eigenvalue is 4.6724. When a = 0.5, 0 = Ag « 0.96945 and ||^ 0.24999, 
and thus the bound is 9.1840, compared with the largest eigenvalue 8.6917. Therefore, 
we conclude for this case a = 1 gives the best spectral condition number of the 
preconditioned matrix, which is a typical result for SPD matrices. 

We now address some implementation issues of the preconditioner related to the 
second choice with 6 = A^+i. Again all that is needed are 17^, and 9. We can show 


an analogue to the expression (3.15) in the following proposition. 

holds: 


Proposition 3.4. The following expression for Gf, \ 


G~k,s = 


1 


1 - 9 


I + t/4(/-Afe)-l-(l-0)-l/]l7; 


(3.19) 


Proof We write U = [Uk, W], where Ufc is as before and W contains the remaining 
columns Ufc+i, • • • , Ug. Note that W is not available but we use the fact that WW'^ = 
I — UkUT for the purpose of this proof. With this, (3.9) becomes: 


Gk!e = I + [Uk, ~ ^ [Uk,W]^ 

= I + Uk[{I- Afc)-' - /] Ul + [(1 - 0)-i - 1] (/ - UkUl) 
= + Uk [(/ - Afc)-' - (1 - 9)-^I] Ul. 


Proposition 3.5. Let the assumptions of Lemma\3.!^ be satisfied. The precon¬ 


ditioner (IS.llI) with the matrix G/^ l defined by (|3.19|) is well-defined and SPD when 
9<l. 


Proof. From (3.19), the eigenvalues of G^. g are (1 —Ai) i = 1,..., k or (1 — 9) 


3.2 


0 < Ai < 1 for all i and thus G^ g is well-defined and SPD 


Recall from Lemma 
when 0 < 1. Hence, preconditioner (|3.11|) is SPD. □ 


We refer to the preconditioner (3.11) with G^, ^ = G^ g as the one-sided DDLR 


preconditioner, abbreviated by DDLR-1. 

3.3. Two-sided low-rank approximation. The method to be presented in 


this section uses low-rank approximations for more terms in (3.7), which yields a 


preconditioner that has a simpler form. Compared with the DDLR-1 method, the re¬ 
sulting preconditioner is less expensive to apply and less accurate in general. Suppose 
that G is factored in the form 


= UV'^, 


(3.20) 


as obtained from the singular value decomposition (SVD), where U G and 

V G is orthogonal. Then, for the matrix G in (3.6), we have the following 

lemma. 


Lemma 3.6. Let G = L - E'^A^ ^E as defined by (|3.6[) be nonsingular. Then, 


G-^ =I + V{I-U^EV) ^U'^E. 
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Furthermore, the following relation holds, 


V^G-^V = {I -U^EV) \ 


(3.21) 


, -1 


Proof. For G we can write 

G-i = {I- {E'^Ao^)E)~^ = {I- VU^E)~^ =I + V{I- U^EVy' U^E. 

Relation ( |3.21[ ) follows from 

V'^G-W = V^{I + V{I- U^EVy^ U^E)V ={I- U^EVy^. 

□ 


From (3.201, the best 2-norm rank-/c approximation to Aq E is of the form 


A^E : 


UkVy 


(3.22) 


where Uk € and Vk G with V^Vk = I consist of the first k columns of U 

and V respectively. For an approximation to G, we define the matrix Gk as 


Gk=I-VkmE . 


(3.23) 


Then, the expression oi A ^ in (3.7) will yield the preconditioner: 


M-^ =A^^ + 


y + UkiV,^GyVk)Ul 


This means that we can build an approximate inverse based on a low-rank correction 
of the form that avoids the use of Vk explicitly. 


M-^ =Ay +UkHkUl with Hk = V^Gy 


Vk. 


(3.24) 


Note that Lemma 3.6 will also hold if U and V are replaced with Uk and Vk- As a 
result, the matrix Hk has an alternative expression that is more amenable to compu¬ 
tation. Specifically, we can show the following lemma. 

Lemma 3.7. Let Gk be defined by (3.23) and assume that matrix I — U^EVk is 
nonsingular. Then, 

Gy =I + VkHkU^E with Hk = {I - ulEVk)-^. 

Furthermore, the following relation holds: 

v^ GyVk = Hk 


i.e., the matrix Hk in (3.24.) o-nd the matrix Hk are equal. 

Proof. A proof can be directly obtained from the proof of Lemma [3.6| by replacing 
matrices U,V and G with Uk,Vk and Gk respectively. □ 

The application of (3.24) requires one solve with Aq and a low-rank correction 
with Uk and Hk. Since A(f'^E is approximated on both sides of G in (3.7), we refer 
to this preconditioner as a two-sided DDLR preconditioner and use the abbreviation 
DDLR-2. 

Proposition 3.8. Assume that UkV([ in (3.22) is the best 2-norm rank-k ap¬ 
proximation to Aq^E, and that Aq is SPD. Then the preconditioner given by (3.24) 
is well-defined and SPD if and only if p(U^EVk) < 1. 
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Proof. The proof follows from Proposition 3.2 in m showing the symmetry of 
iJfc, and Proposition 3.4 and Theorem 3.6 in m for the if-and-only-if condition. □ 
Next, we will show that the eigenvalues of the preconditioned matrix AM~^ are 


between zero and one. Suppose that t/feVjr is obtained as in (3.22), so that we have 
Vk = Uk- Then, the preconditioner (|3.24[) can be rewritten as 


M-^ = A 


0 + UkHkUl = A^i + {A^^E) VkHkV^ 


(3.25) 


where U and V are defined in (3.20). We write U = [C/fe,f7] and V = [14,1^], where 
U and V consist of the s — k columns of U and V that are not contained in Uk and 
Vk- Recall that = I — EVk and define X = I — U'^EV, Z = —U'^EV. From 
(3.10) and (3.20), we have 


A-i = Ao 1 + {A-^^E) V (V^G-^V) (E^A^^) , 


from which and (3.21), it follows that 


A-i = A, 


-1 


+ {Ao^E)V {I -U^EV) V^{E^Ag^), 

_-l „x -1 

= A^^ + {A^^E)V 




Let the Schur complement of Hf. ^ be 

Sk=X- Z'^HkZ e 

and define matrix Sk € IR2(s-fc)x2(s-fe) 

Then, the following lemma shows that Sk is SPD and Sk is SPSD. 


(3.26) 


(3.27) 


(3.28) 


Lemma 3.9. Assume that G defined by (3.7) is nonsingular as well as the matrix 


I — UfEVk- Then, the Schur complement Sk defined by (3.27) is SPD. Moreover, 


matrix Sk is SPSD with s — k positive eigenvalues and s — k zero eigenvalues. 

Proof. From Lemma [3.2[ we can infer that the eigenvalues of G are all positive. 
Thus, G is SPD and so is matrix V"^G~^V. In the end, the Schur complement Sk is 
SPD when Hk is nonsingular. The signs of the eigenvalues of Sk can be easy revealed 
by a block LDL factorization. □ 

Theorem 3.10. Assume that A is SPD. Then the eigenvalues rji of AM~^ with 
M~^ given by (3.24) satisfy 0 < rji < 1. 


Proof. From (3.25) and (3.26), it follows by subtraction that 


A-i - M-i = {A^^E) V 


-1 


(Hk 

\Z'^ X 

Y-1 ryT 


-1 


-(A-^E)V ^ 

-[Ao E)V^ -S-^Z^Hk 


= (Afi^E) V 


(HkZ 

\ 0 


.-1 ) Sk 


Hk 0 
0 0 


-HkZSZ^ 


Z^Hk 


(E^A^^) , 
I/^ (E^A^^) , 

P- (Fl-A-) 

k J 
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where Sk is defined in (3.28), so that 


AM-^ = I - A{Aq^E)V 


HkZ 0 \ ^ fZ'^Hk 0 
0 S', 




Hence, the eigenvalues of AM rji, satisfy 0 < r]i < 1, since Sk is SPSD, and 
{n — s + k) of these eigenvalues are equal to one. □ 


Fig. 3.2. DDLR-2: eigenvalues of AM ^ with k = 5 eigenvectors for a 900 X 900 2-D Laplacian 
with 4 subdomains and a = 1. 



m 


The spectrum of AM~^ for the same matrix used for Figure 3.1 is shown in 
Figure |3.2| Compared with the spectrum with the DDLR-1 method with d = 0 
shown in the left part of Figure |3.1[ the eigenvalues of the preconditioned matrix 
AM~^ are more dispersed between 0 and 1 and the small eigenvalues are closer to 
zero. This suggests that the quality of the DDLR-2 preconditioner will be lower than 
that of DDLR-1, which is supported by the numerical results in Section 

4. Implementation. In this section, we address the implementation details for 
building and applying the DDLR preconditioner, especially focusing on the imple¬ 
mentations in a parallel/distributed environment. 

4.1. Building a DDLR preconditioner. The construction of a DDLR pre¬ 
conditioner involves the following steps. In the first step, a graph partitioner is called 
on the adjacency graph to partition the domain. For each obtained subdomain, we 
separate the interior nodes and the interface nodes, and reorder the local matrix into 
the form of (2.2). The second step is to build a solver for each Bi a = Bi + a~^EiEj. 


These two steps can be done in parallel. The third step is to build a solver for the 
global matrix Ca- We will focus on the solution methods for the linear systems with 
Ca in Section 4.3 The last step, which is the most expensive one, is to compute the 


low-rank approximations. This will be discussed in Section 4.4 


4.2. Applying the DDLR preconditioner. First, consider the DDLR-1 pre¬ 


conditioner (3.11), which we can rewrite as 


M 


-1 


= [l + EGl^,E 


— 1 ipT A — 1 


(4.1) 


The steps involved in applying M~^ to a vector x are listed in Algorithm The 
vector u resulting from the last step will be the desired vector u = M~^x. The solve 
with Aq required in steps 1 and 5 of Algorithm can in turn be viewed as consisting 
of the p independent local solves with Bi^a and the global solve with Ca = C + a^I as 
is inferred from (3.5). Recall that the matrix C, which has the block structure (2.4), is 


the global interface matrix that couples all the interface unknowns. So, solving a linear 
system with Ca will require communication if Ca is assigned to different processors. 
The multiplication with in step 2 transforms a vector of the interior unknowns 
into a vector of the interface unknowns. This can be likened to a descent operation 
that moves objects from a “fine” space to a “coarse” space. The multiplication with 
E in step 4 performs the reverse operation, which can be termed an ascent operation. 
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consisting of going from the interface unknowns to the interior unknowns. Finally, 
the operation with in step 3 involves all the interface unknowns, and it will also 
require communication. In summary, there are essentially 4 types of operations: (1) 
the solve with (2) the solve with Gq,; (3) products with E and which are 
dual of one another; and (4) the application of G^g to vectors. 


Algorithm 1 Preconditioning operations of the DDLR-1 preconditioner. 


Solve: Aqz = X 
Compute: y = E'^z 
Compute: w = G'^^y 
Compute: v = Ew 
Solve: Aqu = x + v 


{Bi^a solves and Ca solve} 
{Interior unknowns to interface neig hbors ) 

(Use tA9h} 

{Interface unknowns to interior neighbors) 
{Bi^a solves and Ca solve) 


Next, consider the DDLR-2 preconditioner given by (3.24). Applying this precon¬ 
ditioner is much simpler, which consists of one solve with Aq and a low-rank correction. 
Communication will be required for applying the low-rank correction term, U}.H}JJ"^, 
to a vector because it involves all the unknowns. We assume that the k x k matrix 
Elk is stored on every processor. 

Parallel implementations of the DDLR methods will depend on how the interface 
unknowns are mapped to processors. A few of the mapping schemes will be discussed 
in Section KSl 


4.3. The global solves with Cq. This section addresses the solution methods 
for Ca required in both the DDLR-1 and the DDLR-2 methods whenever solving 
a linear system with Aq is needed. It is an important part of the computations, 
especially for DDLR-1 as it takes place twice for each iteration. In addition, it is a 
non-local computation and can be costly due to the communication. An important 
characteristic of Ca is that it can be made strongly diagonally dominant by selecting 
a proper scaling factor a. Therefore, the first approach one can think about is to 
use a few steps of the Chebyshev iterations. The Chebyshev method was used with a 
block Jacobi preconditioner Da consisting of all the local diagonal blocks G^ (see, e.g., 
[HI §2.3.9] for the preconditioned Chebyshev method). An appealing property in the 
Chebyshev iterations is that no inner product is needed. This avoids communications 
among processors, which makes this method efficient in particular for distributed 
memory architectures [35]. The price one pays for avoiding communication is that 
this method requires enough knowledge of the spectrum. Therefore, prior to the 
Chebyshev iterations, we performed a few steps of the Lanczos iterations on the 
matrix pair (Gq, Dq.) (SS] §9.2.6] for some estimates (not bounds) of the smallest and 
the largest eigenvalues. The safeguard terms used in were included in order to 
have bounds of the spectrum (see m §13.2] for the definitions of these terms). 

Another approach is to resort to an approximate inverse X « C~^, so that the 
solve with Gq will be reduced to a matrix vector product with X. A simple scheme 
known as the method of Hotelling and Bodewig [20] is given by the iteration 


Xk+l=Xk{2I-CaXk). 


In the absence of dropping, this scheme squares the residual norm Jjl — Gq^^,]] from 
one step to the next, so that it converges quadratically provided that the initial guess 
Xq is such that ]]/ —GqAoJI < 1 for some matrix norm. The global self-preconditioned 
minimal residual (MR) iterations were shown to have superior performance [10| . We 
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adopted this method to build an approximate inverse of Cq,. Given an initial guess Xq, 
the self-preconditioned MR iterations can be obtained by the sequence of operations 
shown in Algorithm Xq was selected as the inverse of the diagonal of Ca ■ The 
numerical dropping was performed by a dual threshold strategy based on a drop 
tolerance and a maximum number of nonzeros per column. 


Algorithm 2 Self-preconditioned global MR iterations with dropping. 

1 : Compute: Rk = I — CaXk {residual} 

2 : Compute: Zk = XkRk (self-preconditioned residual} 

3: Apply numerical dropping to Zk 

4: Compute: /3k = tr{R'[CaZk)/ IlC'aZfell^ {tr(-) denotes the trace} 

5: Compute: Xk+i = Xk + /3kZk 


4.4. Computation of low-rank approximations. For the DDLR-1 method, 
we use the Lanczos algorithm [21] to compute the low-rank approximation to E'^ 
that is of the form UkAkU^■ For the DDLR-2 method, the low-rank approximation to 
Aq^E is of the form UkVk , which can be computed by applying the Lanczos algorithm 
on where 14 is computed and Uk can be obtained by Uk = A^^EVk- 

Alternatively, for the DDLR-2 method, we can also use the Lanczos bidiagonalization 
method m § 10.4] to compute Uk and 14 at the same time. At each step of the 
Lanczos algorithm, a matrix-vector product is required. This means that for each 
step, we need to solve linear systems with Aq: one solve for the DDLR-1 method and 
two for the DDLR-2 method. 

As is well-known, in the presence of rounding error, orthogonality in the Lanczos 
procedure is quickly lost and a form of reorthogonalization is needed in practice. In 
our approach, the partial reorthogonalization scheme muss] is used. The cost of 
this step will not be an issue to the overall performance when a small number of 
steps are performed to approximate a few eigenpairs. To monitor convergence of the 
computed eigenvalues, we adopt the approach used in [TB]. Let and be 

the Ritz values obtained in two consecutive Lanczos steps, m — 1 and m. Assume that 
we want to approximate k largest eigenvalues and k < m. Then with a preselected 
tolerance e, the desired eigenvalues are considered to have converged if 


^m—1 

^m—1 


< e, where Um-i 


and (Tm 

1=1 


k 

m) 



(4.2) 


4.5. Parallel implementations: standard mapping. Considerations of the 
parallel implementations have been mentioned in the previous sections, which suggest 
several possible schemes for distributing the interface unknowns. Before discussing 
these schemes, it will be helpful to overview the issues at hand. Major computations 
in building and applying the DDLR preconditioners are the following: 


1. solve with Bi^a, (local) 

2. solve with Ca, (nonlocal) 

3. products with E'^ and E, (local) 

4a. for DDLR-1, applying in (3.19), (nonlocal) 

4b. for DDLR-2, products with Uk and t/J, (nonlocal) 

5. reorthogonalizations in the Lanczos procedure. (nonlocal) 


The most straightforward mapping we can consider might be to map the unknowns of 
each subdomain to a processor. If p subdomains are used, global matrices A and Ca 
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or its approximate inverse X are distributed among the p processors. So, processor i 
will hold di + Si rows of A and Si rows of Cq or X, where di is the number of the local 
interior unknowns and Si is the number of the local interface unknowns of subdomain i. 
In the DDLR-1 method, Uk S is distributed such that processor i will keep Si 

rows, while in the DDLR-2 method, di + Si rows of Uk & will reside in processor 

i. For all the nonlocal operations, communication is among all the p processors. The 
operations labeled by (2.) and (4a.) involve interface to interface communication, 
while the operations (4b.) and (5.) involve communication among all the unknowns. 
From another perspective, the communication in (4a.), (4b.) and (5.) is of the all¬ 
reduction type required by vector inner products, while the communication in (2.) is 
point-to-point such as that in the distributed sparse matrix vector products. If an 
iterative process is used for the solve with C^, it is important to select a carefully 
so as to reach a compromise between the number of the inner iterations (each of 
which requires communication) and the number of the outer iterations (each of which 
involves solves with Ca)- The scalar a will also play a role if an approximate inverse 
is used, since the convergence of the MR iterations will be affected. 


4.6. Unbalanced mapping: interface unknowns together. Since commu¬ 
nication is required among the interface nodes, an idea that comes to mind is to map 
the interior unknowns of each subdomain to a processor, and all the interface un¬ 
knowns to another separated one. In a case of p subdomains, p -I- 1 processors will be 
used and A is distributed in such a way that processor i owns the rows correspond¬ 
ing to the local interior unknowns for i = 1,.. .p, while processor p -|- 1 holds the 
rows related to all the interface unknowns. Thus, Ca or X will reside entirely on the 
processor p -I- 1. 

A clear advantage of this mapping is that the solve with Ca will require no 
communication. However, the operations with E and are no longer local. Indeed, 

can be viewed as a restriction operator, which “scatters” interface data from 
processor p-|- 1 to the other p processors. Specifically, referring to (2.21, each pi will 
be sent to processor i from processor p -I- I. Analogously, the product with E, as a 
prolongation, will perform a dual operation that “gathers” from processors 1 to p to 
processor p-l-1. In Algorithm[IJ the scatter operation goes before step 2 and the gather 
operation should be executed after step 4. Likewise, if we store the vectors in Uk on 
processor p-l- 1, applying Gk,e will not require communication but another pair of the 
“gather-and-scatter” operations will be needed before and after step 3. Therefore, at 
each application of the DDLR-1 preconditioner, two pairs of the scatter-and-gather 
operations for the interface unknowns will be required. A middle ground approach is 
to distribute Uk to processors 1 to p as it is in the standard mapping. In this way, 
applying Ck^g will require communication but only one pair of the scatter-and-gather 
operations is necessary. On the other hand, in the DDLR-2 method, the distribution 
of Uk should be consistent with that of A. 

The main issue with this mapping is that it is hard to achieve load balancing in 
general. Indeed for a good balancing, we need to have the interior unknowns of each 
subdomain and all the interface unknowns of roughly the same size. However, this 
is difficult to achieve in practice. The load balancing issue is further complicated by 
the fact that the equations needed to be solved on processor p-l-1 are completely 
different from those on the other processors. A remedy to the load balancing issue 
is to use q processors instead of just one dedicated to the global interface (a total of 
p + q processors used in all), which provides a compromise. Then, the communication 
required for solving with Ca and applying Ck,g is confined within the q processors. 
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4.7. Improving a given preconditioner. One of the main weaknesses of stan¬ 
dard, e.g., ILU-type, preconditioners is that they are difficult to update. For example, 
suppose we compute a preconditioner to a given matrix and find that it is not ac¬ 
curate enough to yield convergence. In the case of ILU we would have essentially to 
start from the beginning. For DDLR, improving a given preconditioner is essentially 
trivial. For example, the heart of DDLR-1 consists of obtaining a low-rank approxi¬ 


mation the matrix G defined in (3.6). Improving this approximation would consist in 


merely adding a few more vectors (increasing k) and this can be easily achieved in a 
number of ways without having to throw away the vectors already computed. 

5. Numerical experiments. The experiments were conducted on Itasca, an 
HP ProLiant BL280c G6 Linux cluster at Minnesota Supercomputing Institute, which 
has 2,186 Intel Xeon X5560 processors. Each processor has four cores, 8 MB cache, 
and communicates with memory on a QuickPath Interconnect (QPI) interface. An 
implementation of the DDLR preconditioners was written in C/C-I--I- with the Intel 
Math Kernel Library, the Intel MPI library and PETSc [3 HI H] , compiled by the Intel 
MPI compiler using the -03 optimization level. 

The accelerators used were the conjugate gradient (CG) method when both the 
matrix and the preconditioner are SPD, and the generalized minimal residual (GM- 
RES) method with a restart dimension of 40, denoted by GMRES(40), for the indefi¬ 
nite cases. Three types of preconditioners were compared in our experiments: 1) the 
DDLR preconditioners, 2) the pARMS method [23, and 3) the RAS preconditioner 
|8] (with overlapping). Recall that for an SPD matrix, the DDLR preconditioners 


given by (3.11) and (3.24) will also be SPD if the assumptions in Propositions 3.5 and 


|3.8| are satisfied. However, these propositions will not hold when the solves with Aq 
are approximate, which is typical in practice. Instead, the positive definiteness can 
be determined by checking if the largest eigenvalue is less than one for DDLR-1 or by 
checking the positive definiteness of Hk for DDLR-2. DDLR-1 was always used with 
S = Afc+i. 

Each Bi^a was reordered by the approximate minimum degree ordering (AMD) 
[niKii] to reduce fill-ins and then we simply used an incomplete Cholesky or LDL 
factorization as the local solver. A more efficient and robust local solver, for exam¬ 
ple, the ARMS approach in [3S]) can lead to better performance in terms of both 
the memory requirement and the speed. However, this has not been implemented in 
our current code. A typical setting of the scalar a for Ca and Bi a is a = 1, which 
in general gives the best overall performance, the exceptions being the three cases 
shown in Section [5.2[ for which choosing a > 1 improved the convergence. Regarding 
the solves with Ca, using the approximate inverse is generally more efficient than 
the Ghebyshev iterations, especially in the iteration phase. However, computing the 
approximate inverse can be costly, in particular for the indefinite 3-D cases. The 
standard mapping was adopted unless specially stated, which in general gave bet¬ 
ter performance than the unbalanced mapping. The behavior of these two types of 
mappings will be analyzed by the results in Table |5.3[ In the Lanczos algorithm, the 


convergence was checked every 10 iterations and the tolerance e in (4.2) used for the 
checking was 10“^. In addition, the maximum number of the Lanczos steps was five 
times the number of the requested eigenvalues. 

For pARMS, the ARMS method was used to be the local preconditioner and the 
Schur complement method was used as the global preconditioner, where the reduced 
system was solved by a few inner Krylov subspace iterations preconditioned by the 
block-Jacobi preconditioner. For the details of these options in pARMS, we refer 
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the readers to [371 EH]. We point out that when the inner iterations are enabled, 
flexible Krylov subspace methods will be required for the outer iterations, since the 
preconditioning is no longer fixed from one outer iteration to the next. So, the flexible 
GMRES El] was used. For the RAS method, ILU(fc) was used as the local solver, 
and a one-level overlapping between subdomains was used. Note that the RAS pre¬ 
conditioner is nonsymmetric even for a symmetric matrix, so that GMRES was used 
with it. 

We first report on the results of solving the linear systems from a 2-D and a 3-D 
PDFs on regular meshes. Next, we will show the results for solving a sequence of 
general sparse symmetric linear systems. For all the problems, a parallel multilevel 
fc-way graph partitioning algorithm from ParMETiS [2T1 [22] was used for the DD. 
Iterations were stopped whenever the residual norm had been reduced by 6 orders of 
magnitude or the maximum number of iterations allowed, which is 500, was exceeded. 
The results are shown in Tables 5.1 5.2 and 5.5 where all timings are reported in 
seconds and ‘F’ indicates non-convergence within the maximum allowed number of 
steps. When comparing the preconditioners, the following factors are considered: 1) 
fill-ratio, i.e., the ratio of the number of nonzeros required to store a preconditioner 
to the number of nonzeros in the original matrix, 2) time for building preconditioners, 
3) the number of iterations and 4) time for the iterations. 

5.1. Model problems. We examine a 2-D and a 3-D PDF, 

—Au — cu = f in D, 

M = 0 on 912, 


(5.1) 


and 912 is the boundary. We take the 5-point (or 


where 12 = (0,1)^ or 12 = (0,1)^, 

7-point) centered difference approximation. To begin with, we solve (5.1) with c = 0. 
The matrix is SPD, so that we use DDLR with CG. Numerical experiments were 
carried out to compare the performance of DDLR with those of pARMS and RAS. 
The results are shown in Table 5.1 The mesh sizes, the number of processors (Np), 
the rank (rk), the fill-ratios (nz), the numbers of iterations (its), the time for building 
the preconditioners (p-t) and the time for iterations (i-t) are tabulated. We tested 
the problems on 6 2-D and 6 3-D meshes of increasing sizes, where the number of 
processors was growing proportionally such that the problem size on each processor 
was kept roughly the same. This can serve as a weak scaling test. We increased the 
rank k used in DDLR with the meshes sizes. The fill-ratios of DDLR-1 and pARMS 
were controlled to be roughly equal, whereas the fill of DDLR-2 was much higher, 
which comes mostly from the matrix Uk when k is large. For pARMS, the inner 
Krylov subspace dimension used was 3. 

The time for building DDLR is much higher and it grows with the rank and 
the number of the processors. In contrast, the time to build pARMS and RAS is 
roughly constant. This set-up time for DDLR is typically dominated by the Lanczos 
algorithm, where solves with Bi a and Ca are required at each iteration. Moreover, 
when k is large, the cost of reorthogonalization becomes significant. As shown in Table 
|5.1| DDLR-1 and pARMS were more robust as they succeeded for all the 2-D and 
3-D cases, while DDLR-2 failed for the largest 2-D case and RAS failed for the three 
largest ones. For most of the 2-D problems, DDLR-l/GG achieved convergence in the 
fewest iterations and the best iteration time. For the 3-D problems, DDLR-1 required 
more iterations but a performance gain was still achieved in terms of the reduced 
iteration time. Exceptions were the two largest 3-D problems, where RAS/GMRES 
yielded the best iteration time. 
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Table 5.1 

Comparison between DDLR, pARMS and RAS preconditioners for solving SPD linear systems 
from the 2-D/3-D PDE with the CG or GMRES(40) method. 


Mesh 

Np 

rk 

DDLR-1 
nz its p-t 

i-t 

rk 

DDLR-2 
nz its p-t 

i-t 

12^ 

2 

8 

6.6 

15 

.209 

.027 

8 

8.2 

30 

.213 

.031 

2502 

8 

16 

6.6 

34 

.325 

.064 

16 

9.7 

69 

.330 

.083 

512^ 

32 

32 

6.8 

61 

.567 

.122 

32 

13.0 

132 

.540 

.194 

10242 

128 

64 

7.0 

103 

1.12 

.218 

64 

19.3 

269 

1.03 

.570 

14482 

256 

91 

7.2 

120 

1.67 

.269 

91 

24.7 

385 

1.72 

1.05 

20482 

512 

128 

7.6 

168 

3.02 

.410 

128 

32.2 

F 

- 

- 

253 

2 

8 

7.2 

11 

.309 

.025 

8 

8.3 

17 

.355 

.021 

50^ 

16 

16 

7.5 

27 

.939 

.064 

16 

9.3 

52 

.958 

.076 

643 

32 

16 

7.4 

36 

1.06 

.089 

16 

9.2 

67 

1.07 

.102 

1003 

128 

32 

8.0 

52 

1.57 

.136 

32 

11.5 

101 

1.48 

.190 

1263 

256 

32 

8.2 

65 

2.07 

.178 

32 

12.5 

126 

1.87 

.265 

1593 

512 

51 

8.7 

85 

2.92 

.251 

51 

14.2 

156 

2.50 

.387 


Mesh 

Np 

nz 

pARMS 

its 

p-t 

i-t 

nz 

RAS 

its 

p-t 

i-t 

12^ 

2 

6.7 

15 

062 

.037 

2.7 

40 

.003 

.032 

256^ 

8 

6.7 

30 

066 

.082 

2.7 

102 

.004 

.072 

512^ 

32 

6.9 

52 

072 

.194 

2.7 

212 

.005 

.157 

10242 

128 

6.6 

104 

100 

.359 

2.7 

F 

.008 

- 

14482 

256 

6.6 

247 

073 

.820 

2.7 

F 

.011 

- 

20482 

512 

6.8 

282 

080 

1.06 

2.7 

F 

.015 

- 

2^ 

2 

7.3 

9 

100 

.032 

5.9 

13 

.004 

.041 

50^ 

16 

8.1 

17 

179 

.095 

6.7 

28 

.006 

.071 

643 

32 

8.2 

20 

142 

.121 

6.7 

34 

.007 

.103 

1003 

128 

8.3 

29 

170 

.198 

6.7 

51 

.011 

.148 

1263 

256 

8.4 

34 

166 

.216 

6.7 

60 

.014 

.127 

1593 

512 

8.5 

40 

179 

.275 

6.7 

83 

.019 

.183 


Next, we consider solving symmetric indefinite problems by setting c > 0 in (5.1) 


which corresponds to shifting the discretized negative Laplacian by subtracting al 
with a certain cr > 0. In this set of experiments, we reduce the size of the shift as 
the problem size increases in order to make the problems fairly difficult but not too 
difficult to solve for all the methods. We used higher ranks in the two DDLR methods 
and a higher inner iteration number, which was 6, in pARMS. Results are reported 
in Table |5.2[ From there we can see that DDLR-2 did not perform well as it failed 
for almost all the problems. Second, RAS failed for all the 2-D cases and three 3-D 
cases. But for the three cases where it worked, it yielded the best iteration time. 
Third, DDLR-1 achieved convergence in all the cases whereas pARMS failed for two 
2-D cases. Comparison between DDLR-1 and pARMS shows a similar result as in the 
previous set of experiments: for the 2-D cases, DDLR-1 required fewer iteration and 
less iteration time, while for the 3-D cases, it might require more iterations but still 
less iteration time. 


In all the previous tests, DDLR-1 was used with the standard mapping. In the 
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Table 5.2 

Comparison between DDLR, pARMS and RAS preconditioners for solving symmetric indefinite 
linear systems from the 2-D/3-D PDEs with the GMRES(40) method. 


Mesh 

Np 

a 

rk 

DDLR-1 
nz its p-t 

i-t 

rk 

DDLR-2 
nz its p-t 

i-t 

128^ 

2 

le-l 

16 

6.8 

18 

.233 

.034 

16 

13.2 

146 

.310 

.234 

256^ 

8 

le-2 

32 

6.8 

38 

.674 

.080 

16 

13.0 

F 

1.01 

- 

512^ 

32 

le-3 

64 

7.1 

48 

1.58 

.105 

64 

19.4 

F 

1.32 

- 

1024^ 

128 

2e-4 

128 

7.6 

68 

4.15 

.160 

128 

32.3 

F 

4.45 

- 

1448^ 

256 

5e-5 

182 

8.1 

100 

7.14 

.253 

182 

43.2 

F 

7.77 

- 

2048^ 

512 

2e-5 

256 

8.8 

274 

12.6 

.749 

256 

58.4 

F 

13.1 

- 

25^ 

2 

.25 

16 

8.3 

29 

.496 

.099 

16 

9.6 

62 

.595 

.130 

50® 

16 

7e-2 

32 

8.2 

392 

1.38 

1.19 

32 

10.2 

F 

1.66 

- 

64® 

32 

3e-2 

64 

8.9 

201 

2.26 

.688 

64 

16.3 

F 

2.08 

- 

100® 

128 

2e-2 

128 

11.4 

279 

5.17 

1.08 

128 

28.7 

F 

5.29 

- 

126® 

256 

7e-3 

128 

12.8 

255 

5.85 

1.10 

128 

28.3 

F 

6.01 

- 

159® 

512 

5e-3 

160 

13.5 

387 

8.60 

1.71 

160 

33.0 

F 

8.33 

- 


Mesh 

Np 

a 

pARMS 
nz its 

p-t 

i-t 

nz 

RAS 

its 

p-t 

i-t 

128^ 

2 

le-l 

11.4 

76 

.114 

.328 

2.7 

F 

.003 

- 

256® 

8 

le-2 

13.9 

F 

- 

- 

2.7 

F 

.004 

- 

512® 

32 

le-3 

12.3 

298 

.181 

1.53 

2.7 

F 

.005 

- 

1024® 

128 

2e-4 

12.5 

232 

.230 

1.46 

2.7 

F 

.008 

- 

1448® 

256 

5e-5 

12.5 

F 

- 

- 

2.7 

F 

.011 

- 

2048® 

512 

2e-5 

12.6 

314 

.195 

2.13 

2.7 

F 

.015 

- 


2 

.25 

8.3 

100 

.156 

.599 

5.9 

108 

.004 

.123 

50® 

16 

7e-2 

8.9 

448 

.142 

2.59 

6.7 

F 

.006 

- 

64® 

32 

3e-2 

8.9 

130 

.115 

.784 

6.7 

252 

.007 

.375 

100® 

128 

2e-2 

9.4 

187 

.137 

1.24 

6.7 

343 

.011 

.541 

126® 

256 

7e-3 

10.6 

340 

.137 

2.74 

6.7 

F 

.014 

- 

159® 

512 

5e-3 

10.8 

329 

.148 

2.85 

6.7 

F 

.019 

- 


next set of experiments, we examined the behavior of the unbalanced mapping dis¬ 
cussed in Section |4.6[ In these experiments, we tested the problem on a 128 x 128 
mesh and a 25 x 25 x 25 mesh. Both of them were divided into 128 subdomains. 
Note here that the problem size per processor is remarkably small. This was made on 
purpose since it can make the communication cost more significant (and likely to be 
dominant) in the overall cost for the solve with Ca such that it can make the effect 
of the unbalanced mapping more prominent. Table |5.3| lists the iteration time for 
solving the SPD problem using the standard mapping and the unbalanced mapping 
with different settings. Two solution methods for Ca were tested, the one with the 
approximate inverse and the preconditioned Chebyshev iterations (5 iterations were 
used per solve). In the unbalanced mapping, q processors were used dedicated to the 
interface unknowns {p q processors were used totally). The unbalanced mapping 
was tested with 8 different q values from 1 to 96. g = 1 is a special case where no 
communication is involved in the solve with Ca- The matrix Up was stored on the 
p processors, so that only one pair of the scatter-and-gather communication was re¬ 
quired at each outer iteration as discussed in Section [476| The standard mapping is 
indicated by g = 0. 

As the results indicated, the iteration time kept decreasing at beginning as g 
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Table 5.3 

Comparison of the iteration time (in milliseconds) between the standard mapping and the un¬ 
balanced mapping for solving 2-D/3-D SPD PDE problems by the DDLR-l-CG method. 


Mesh 


9 = 0 

1 

2 

4 

8 

16 

32 

64 

96 


AINV 

9.0 

53.4 

27.8 

15.4 

13.9 

10.8 

9.1 

9.4 

13.5 


Cheb 

9.2 

116.8 

60.7 

29.9 

15.8 

10.7 

10.0 

8.3 

9.1 

ocS 

AINV 

15.1 

119.7 

66.3 

34.7 

26.0 

19.2 

17.2 

14.8 

18.2 


Cheb 

13.8 

368.3 

166.0 

78.3 

38.5 

20.4 

14.4 

12.3 

13.5 


increased but after some point it started to increase. This is a typical situation corre¬ 
sponding to the balance between communication and computation: when q is small, 
the amount of computation on each of the q processors is high and it dominates the 
overall cost, so that the overall cost will keep being reduced as q increases until the 
point when the communication cost starts to affect the overall performance. The 
optimal numbers of the interface processors that yielded the best iteration time are 
shown in bold in Table [53| For these two cases, the optimal iteration time with the 
unbalanced mapping was slightly better than that with the standard mapping. How¬ 
ever, we need to point out that this is not a typical case in practice. For all the other 
tests in this section, we used the standard mapping in the DDLR-1 preconditioner. 


5.2. General matrices. We selected 12 symmetric matrices from the University 
of Florida sparse matrix collection [T^] for the following tests. Table 5.4 lists the name, 
the order (N), the number of nonzeros (NNZ), and a short description for each matrix. 
If the actual right-hand side is not provided, an artificial one was created as b = Ae, 
where e is a random vector. 


Table 5.4 

Names, orders (N), numbers of nonzeros (NNZ) and short descriptions of the test matrices. 


MATRIX 

N 

NNZ 

DESGRIPTION 

Andrews/Andrews 

60,000 

760,154 

computer graphics problem 

UTEP/Dubcova2 

65,025 

1,030,225 

2-D/3-D PDE problem 

Rothberg/cfdl 

70,656 

1,825,580 

GED problem 

Schmid / thermall 

82,654 

574,458 

thermal problem 

Rothberg/cfd2 

123,440 

3,085,406 

GED problem 

UTEP/Dubcova3 

146,689 

3,636,643 

2-D/3-D PDE problem 

Botonakis/thermo _TK 

204,316 

1,423,116 

thermal problem 

Wissgott/paraJem 

525,825 

3,674,625 

GED problem 

CEMW/tmt_sym 

726,713 

5,080,961 

electromagnetics problem 

McRae / ecology2 

999,999 

4,995,991 

landscape ecology problem 

McRae / ecologyl 

1 ,000,000 

4,996,000 

landscape ecology problem 

Schmid / thermal2 

1,228,045 

8,580,313 

thermal problem 


Table |5.5| shows the results for each problem. DDLR-1 and DDLR-2 were used 
with GMRES(40) for three problems tmt_sym, ecologyl and ecology2, where the 
preconditioners were found not to be SPD, while for the other problems CG was 
applied. We set the scalar a = 2 for two problems ecologyl and ecology2, where 
it turned out to reduce the numbers of iterations, but for elsewhere we use a = 1. 
As shown by the results, DDLR-1 achieved convergence for all the cases, whereas the 
other three preconditioners all had failures for a few cases. Similar to the experimental 
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results for the model problems, the DDLR preconditioners required more time to 
construct. Compared with pARMS and RAS, DDLR-1 achieved time savings in the 
iteration phase for 7 (out of 12) problems and DDLR-2 did so for 4 cases. 


Table 5.5 

Comparison among DDLR, pARMS and RAS preconditioners for solving general sparse sym¬ 
metric linear systems along with CG or GMRES(40). 


Matrix 

Np 

rk 

DDLR-1 
nz its p-t 

i-t 

rk 

nz 

DDLR-2 
its p-t 

i-t 

Andrews 

8 

8 

4.7 

33 

.587 

.220 

8 

5.2 

53 

.824 

.175 

Dubcova2 

8 

16 

3.5 

18 

.850 

.054 

16 

4.5 

44 

.856 

.079 

cfdl 

8 

8 

18.1 

17 

7.14 

.446 

8 

18.4 

217 

6.44 

2.97 

thermall 

8 

16 

6.0 

48 

.493 

.145 

16 

8.3 

126 

.503 

.234 

cfd2 

16 

8 

13.2 

12 

4.93 

.232 

8 

13.4 

F 

5.11 

- 

DubcovaS 

16 

16 

2.6 

16 

1.70 

.061 

16 

3.2 

44 

1.71 

.107 

thermo_TK 

16 

32 

6.4 

24 

.568 

.050 

32 

10.8 

63 

.537 

.096 

paraJem 

16 

32 

7.8 

59 

4.02 

.777 

32 

12.3 

159 

4.12 

1.35 

tmt_sym 

16 

16 

7.3 

33 

5.56 

.668 

16 

9.5 

62 

5.69 

.790 

ecology2 

32 

32 

8.9 

39 

3.67 

.433 

32 

15.2 

89 

3.79 

.709 

ecology 1 

32 

32 

8.8 

40 

3.48 

.423 

32 

15.1 

82 

3.59 

.656 

thermal2 

32 

32 

6.8 

140 

5.06 

2.02 

32 

11.3 

F 

5.11 

- 


Matrix 

Np 

pARMS 
nz its p-t 

i-t 

nz 

RAS 

its 

p-t 

i-t 

Andrews 

8 

4.3 

15 

.217 

.109 

3.6 

19 

.010 

.073 

Dubcova2 

8 

3.5 

25 

.083 

.090 

3.5 

43 

.008 

0.11 

cfdl 

8 

16.1 

F 

.091 

- 

10.6 

153 

.013 

3.55 

thermall 

8 

5.4 

39 

.089 

.153 

4.6 

156 

.006 

.235 

cfd2 

16 

26.0 

F 

.120 

- 

11.9 

310 

.012 

3.26 

Dubcova3 

16 

2.6 

37 

.130 

.200 

4.2 

39 

.013 

.212 

thermo_TK 

16 

4.9 

16 

.048 

.035 

5.5 

34 

.004 

.067 

para_fem 

16 

6.5 

89 

.586 

1.36 

5.1 

247 

.019 

1.18 

tmt_sym 

16 

6.9 

16 

.587 

.361 

3.7 

26 

.026 

.222 

ecology2 

32 

9.9 

15 

.662 

.230 

5.8 

28 

.017 

.165 

ecology 1 

32 

10.0 

14 

.664 

.220 

5.8 

27 

.017 

.161 

thermal2 

32 

6.1 

205 

.547 

3.70 

4.7 

F 

.025 

- 


6. Conclusion. This paper presented a preconditioning method for solving dis¬ 
tributed symmetric sparse linear systems, based on an approximate inverse of the 
original matrix which exploits the domain decomposition method and low-rank ap¬ 
proximations. Two low-rank approximation strategies are discussed, called DDLR-1 
and DDLR-2. In terms of the number of iterations and iteration time, experimental 
results indicate that for SPD systems, the DDLR-1 preconditioner can be an effi¬ 
cient alternative to other domain decomposition-type approaches such as one based 
on distributed Schur complements (as in pARMS), or on the RAS preconditioner. 
Moreover, this preconditioner appears to be more robust than the pARMS method 
and the RAS method for indefinite problems. 

The DDLR preconditioners require more time to build than other DD-based pre¬ 
conditioners. However, one must take a number of other factors into account. First, 
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some improvements can be made to reduce the set-up time. For example, more ef¬ 
ficient local solvers such as ARMS can be used instead of the current ILUs; vector 
processing such as GPU computing can accelerate the computations of the low-rank 
corrections; and more efficient algorithms than the Lanczos method, e.g., randomized 
techniques [18], can be exploited for computing the eigenpairs. Second, there are 
many applications in which many systems with the same matrix must be solved. In 
this case more expensive but more effective preconditioners may be justified because 
their cost can be amortized. Finally, another important factor touched upon briefly 
in Section 4.7 is that the preconditioners discussed here are more easily updatable 
than traditional ILU-type or DD-type preconditioners. 
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